library(tidyverse)
library(caret)
library(nnet)
library(plotly)
library(randomForest)

Statistical modeling and prediction

1. Effect of sea water temperature on fish in the Great Barrier Reef

1.1 Exploratory data analysis and visualization

In this analysis, we will examine the effects of water temperature on the number of unique fish species observed in the Great Barrier Reef from 1997 to 2011.

Originally, we wanted to examine a relationship between water temperature and coral cover. However, we quickly saw that, when visualized, there does not seem to be a linear relation and we did not want to fit a mis-specified model.

Merge between temperature and coral cover data sets

temp <- read_csv("cleaned_data/aims_temperatures.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   avg_water_temp_2.0m_flat_site = col_double(),
##   avg_water_temp_9.0m_slope_site = col_double()
## )
coral_cover <- read_csv("cleaned_data/coral_cover.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   mean_live_coral_cover_percent = col_double(),
##   lower_conf_int = col_double(),
##   upper_conf_int = col_double(),
##   conf_int_span = col_double()
## )
temp_coral_cover <- merge(x=temp, y=coral_cover, by="date")
temp_coral_cover %>%
    ggplot(aes(avg_water_temp_2.0m_flat_site, mean_live_coral_cover_percent)) + 
    geom_point()

temp_coral_cover %>%
    ggplot(aes(avg_water_temp_9.0m_slope_site, mean_live_coral_cover_percent)) + 
    geom_point()

As we can see, there is no linear relationship. While there are 2 or 3 clusters of coral cover percentage values, they seem to be pretty consistent across the range of water temperatures at 2.0m and 9.0 below sea level.

We also suspect that increase in water temperature could also affect diversity in sea life. So, we will examine the relationship between water temperature and the number of unique species of fish observed in the Great Barrier Ree. First, we can do some basic visualization by plotting the relationship between water temperatures at depth 2.0m and 9.0m with the number of unique fish species observedin the Great Barrier Reef.

fish <- read_csv("cleaned_data/fish_species_counts.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   num_of_species = col_double()
## )
fish_temp <- merge(x=temp, y=fish, by="date")

# water temp at 2.0m
fish_temp %>% ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + geom_point()

# water temp at92.0m
fish_temp %>% ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + geom_point()

There seems to be a bit of negative linear relationship going on, so we will fit a linear model examining the number of unique species discovered in relation to rising temperature.

1.2 Linear regression

Now, we can split our data set into train and test sets, using 0.6 to partition our data. Our outcome is the mean coral cover percentage and water temperature is our covariate. We will fit 2 linear regression models: one examining effect of water temperature at 2.0m and the other examining the effect of temperature at 9.0m.

train_index <- createDataPartition(y=fish_temp$num_of_species, times=1, p = 0.6, list=FALSE)

train_set <- fish_temp[train_index, ]
test_set <- fish_temp[-train_index, ]

Fit linear regression model:

fish_temp_2.0m <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site, data=train_set)
summary(fish_temp_2.0m)
## 
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.577  -9.236   0.413   9.609  38.877 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   140.0168    16.2550   8.614 1.98e-15 ***
## avg_water_temp_2.0m_flat_site  -2.8334     0.5914  -4.791 3.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.8 on 203 degrees of freedom
## Multiple R-squared:  0.1016, Adjusted R-squared:  0.09717 
## F-statistic: 22.96 on 1 and 203 DF,  p-value: 3.196e-06
fish_temp_9.0m <- lm(num_of_species ~ avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp_9.0m)
## 
## Call:
## lm(formula = num_of_species ~ avg_water_temp_9.0m_slope_site, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.401  -8.931   0.906   9.523  39.400 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    142.5193    16.7272   8.520 3.59e-15 ***
## avg_water_temp_9.0m_slope_site  -2.9379     0.6114  -4.805 3.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.8 on 203 degrees of freedom
## Multiple R-squared:  0.1021, Adjusted R-squared:  0.09771 
## F-statistic: 23.09 on 1 and 203 DF,  p-value: 3e-06

We see that the models are very similar in results. The coefficient with covariate 2.0m water temperature and 9.0 water temperature is -2.56 and -2.61, respectively.

Although we should not expect a major difference between how each of the two models performs, let’s compare them anyways to assess which water temperature depth is a better predictor of unique fish species observed in the Great Barrier Reef.

pred_2.0m <- predict(fish_temp_2.0m, test_set)
pred_9.0m <- predict(fish_temp_9.0m, test_set)

postResample(pred = pred_2.0m, obs = test_set$num_of_species)
##        RMSE    Rsquared         MAE 
## 15.73121376  0.03996665 11.87241580
postResample(pred = pred_9.0m, obs = test_set$num_of_species)
##        RMSE    Rsquared         MAE 
## 15.67723856  0.04565415 11.82100056

As expected, results are very similar.

We can assess this visually to confirm our results.

# water temp at 2.0m
test_set %>% 
    ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + 
    geom_point() +
    geom_abline(intercept=fish_temp_2.0m$coefficients[1], slope=fish_temp_2.0m$coefficients[2], col="red")

# water temp at 9.0m
test_set %>% 
    ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + 
    geom_point() +
    geom_abline(intercept=fish_temp_9.0m$coefficients[1], slope=fish_temp_9.0m$coefficients[2], col="blue")

They both perform very similarly, and choosing either water temperature as our predictor will yield similar results.



2. Classification of sea grass species in the Great Barrier Reef from 1999 - 2003.

Continuing on from sea life diversity, we have data on presence or absence of certain seagrass species in the Great Barrier Reef from 1999 to 2003. Let’s try to build a classifier to determine how location, and types of sediment and seabed may predict presence of certain sea grass.

As written in the data wrangling and cleaning RMarkdown/HTML file, we chose 4 species to classify: Cymodocea Serrulata, Syringodium Isoetifolium, Thalassia Hemprichii, and Zostera Muelleri (subspecies Capricorni).

2.1 Data exploratory analysis and visualization

seagrass <- read.csv("cleaned_data/seagrass_classification_data.csv", as.is =TRUE)

seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)

head(seagrass)
summary(seagrass)
##        SPECIES         LATITUDE        LONGITUDE         DEPTH        
##  C_SERRULAT: 1256   Min.   :-24.20   Min.   :142.5   Min.   : 0.0000  
##  S_ISOETIFO:  101   1st Qu.:-23.79   1st Qu.:146.8   1st Qu.: 0.0000  
##  T_HEMPRICH:  823   Median :-22.41   Median :150.5   Median : 0.0000  
##  Z_CAPIRCOR:10201   Mean   :-21.06   Mean   :149.0   Mean   : 0.9253  
##                     3rd Qu.:-19.17   3rd Qu.:151.3   3rd Qu.: 1.2279  
##                     Max.   :-10.75   Max.   :151.9   Max.   :29.3583  
##                                                                       
##    SEDIMENT           TIDAL     
##  Gravel:   1   intertidal:7433  
##  Mud   :8969   subtidal  :4948  
##  Reef  :  63                    
##  Rock  :  66                    
##  Rubble: 107                    
##  Sand  :3151                    
##  Shell :  24

First we plot sea grass according to location (latitude and longitude). Then we will add a third axis (depth) to visualize this in 3-dimensions using plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.

seagrass %>%  ggplot() + geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES))

plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

We can also see how our categorical predictors relate to sea grass species.

seagrass %>% 
    ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

seagrass %>% 
    ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")



2.2 Random forest

We will first use random forest to build a classifier and then use a multinomial logistic regression model, and compare the two.

Let’s first partition our data set into train and test sets. Since we have a lot more data here than in the linear regression model, we will partition it by 75%-25%.

# test-train split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)

train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, 
                       data=train_set,
                       mtry = 2)
rf_fit
## 
## Call:
##  randomForest(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH +      SEDIMENT + TIDAL, data = train_set, mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.61%
## Confusion matrix:
##            C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR class.error
## C_SERRULAT        712         15         10        205  0.24416136
## S_ISOETIFO         32         28          5         11  0.63157895
## T_HEMPRICH         45          2        558         13  0.09708738
## Z_CAPIRCOR         80          5          5       7561  0.01176317
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
## 
##             true
## pred         C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        227         12         18         22
##   S_ISOETIFO          3          7          0          1
##   T_HEMPRICH          6          1        185          1
##   Z_CAPIRCOR         78          5          2       2526
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9518          
##                  95% CI : (0.9437, 0.9591)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8346          
##                                           
##  Mcnemar's Test P-Value : 2.089e-08       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.72293          0.280000           0.90244
## Specificity                    0.98129          0.998697           0.99723
## Pos Pred Value                 0.81362          0.636364           0.95855
## Neg Pred Value                 0.96909          0.994162           0.99311
## Prevalence                     0.10149          0.008080           0.06626
## Detection Rate                 0.07337          0.002262           0.05979
## Detection Prevalence           0.09017          0.003555           0.06238
## Balanced Accuracy              0.85211          0.639348           0.94983
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9906
## Specificity                     0.8438
## Pos Pred Value                  0.9674
## Neg Pred Value                  0.9503
## Prevalence                      0.8242
## Detection Rate                  0.8164
## Detection Prevalence            0.8439
## Balanced Accuracy               0.9172

We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, we got quite a low sensitivity for S_ISOETIFO. Recall to our data wrangling portion that S_ISOETIFO had only about 100 “Yes” observations. Since we had a small sample size for S_ISOETIFO relative to the other 3 seagrass species, this may have contributed to the low sensitivity.

We can visually assess our predicted values with true values of species to see how our model performed.

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")

For sediment type:

test_set %>% 
    ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

test_set %>% 
    ggplot(aes(SEDIMENT, fill=rf_pred)) + geom_bar(width=.5, position = "dodge")

For seabed type:

test_set %>% 
    ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

test_set %>% 
    ggplot(aes(TIDAL, fill=rf_pred)) + geom_bar(width=.5, position = "dodge")

Let’s examine variable importance.

variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
tmp

We see that longitude and latitude were very predictive of presence of seagrass followed by depth from sea level. The types of sediment and seabed (intertidal or subtidal seabed) are not very good predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.

tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) + 
  geom_bar(stat='identity') +
  coord_flip() + xlab("Feature") +
  theme(axis.text=element_text(size=8))



2.3 Multinomial logistic regression

We can now try a multinomial logistic regression model to see how it compares to random forest. We will use the nnet package.

The logistic regression model is as follows:

multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights:  44 (30 variable)
## initial  value 12874.515732 
## iter  10 value 3231.438784
## iter  20 value 2579.045689
## iter  30 value 2488.644590
## iter  40 value 2418.828498
## iter  50 value 2403.582568
## iter  60 value 2377.804660
## iter  70 value 2377.669690
## iter  80 value 2377.612183
## iter  90 value 2375.572351
## iter 100 value 2375.518392
## final  value 2375.518392 
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, 
##     data = train_set)
## 
## Coefficients:
##            (Intercept)  LATITUDE LONGITUDE        DEPTH SEDIMENTMud
## S_ISOETIFO   -346.4807 2.2876720  3.011165 -0.001924993   -56.98242
## T_HEMPRICH   -137.8337 1.9013715  1.334503 -0.320212571   -25.95114
## Z_CAPIRCOR   -175.5349 0.6886591  1.486399 -0.830635230   -27.10451
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO    -55.78712    -56.71262      -65.63158    -56.06705     -55.29990
## T_HEMPRICH    -22.69174    -21.99150      -21.58262    -22.94339     -22.67330
## Z_CAPIRCOR    -30.90690    -29.86489      -31.21765    -28.01992     -28.42101
## 
## Std. Errors:
##            (Intercept)   LATITUDE   LONGITUDE      DEPTH SEDIMENTMud
## S_ISOETIFO 0.002745115 0.05727101 0.007931424 0.02544938   0.4149306
## T_HEMPRICH 0.001313935 0.05666575 0.007251568 0.04008419   0.2994008
## Z_CAPIRCOR 0.002608711 0.02894079 0.004613846 0.02963196   0.3500373
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO    0.7324316    0.9353767   2.028653e-05    0.3841211     0.9586960
## T_HEMPRICH    0.5047530    0.6375312   4.731509e-01    0.2702049     0.9276766
## Z_CAPIRCOR    0.9410291    0.7697505   1.083037e+00    0.3521422     0.8287129
## 
## Residual Deviance: 4751.037 
## AIC: 4805.037

Relative risk ratios where reference group is C_SERRULAT

exp(coef(multinom_fit))
##              (Intercept) LATITUDE LONGITUDE     DEPTH  SEDIMENTMud SEDIMENTReef
## S_ISOETIFO 3.352302e-151 9.851976 20.311041 0.9980769 1.789980e-25 5.915087e-25
## T_HEMPRICH  1.379076e-60 6.695071  3.798108 0.7259947 5.364941e-12 1.396699e-10
## Z_CAPIRCOR  5.836702e-77 1.991044  4.421147 0.4357724 1.693021e-12 3.778359e-14
##            SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 2.344337e-25   3.137364e-29 4.470860e-25  9.628484e-25
## T_HEMPRICH 2.813269e-10   4.234346e-10 1.085953e-10  1.422695e-10
## Z_CAPIRCOR 1.071129e-13   2.769142e-14 6.778028e-13  4.538492e-13
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")

# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")

confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        126         12         11         38
##   S_ISOETIFO          1          1          0          0
##   T_HEMPRICH         15          2        186         13
##   Z_CAPIRCOR        172         10          8       2499
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9089          
##                  95% CI : (0.8982, 0.9188)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6661          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.40127         0.0400000           0.90732
## Specificity                    0.97806         0.9996742           0.98962
## Pos Pred Value                 0.67380         0.5000000           0.86111
## Neg Pred Value                 0.93533         0.9922380           0.99340
## Prevalence                     0.10149         0.0080802           0.06626
## Detection Rate                 0.04072         0.0003232           0.06012
## Detection Prevalence           0.06044         0.0006464           0.06981
## Balanced Accuracy              0.68967         0.5198371           0.94847
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9800
## Specificity                     0.6507
## Pos Pred Value                  0.9293
## Neg Pred Value                  0.8741
## Prevalence                      0.8242
## Detection Rate                  0.8077
## Detection Prevalence            0.8691
## Balanced Accuracy               0.8154

We see that our multinomial logistic model has about 91% overall accuracy, which performs a bit worse than random forest. However, the model performs very badly in predicting for S_ISOETIFO.

The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. The model also does not perform well for sensitivity of C_SERRULAT, with only about 41.7% sensitivity.

Again, we can assess our results visually:

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")

For sediment type:

test_set %>% 
    ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

test_set %>% 
    ggplot(aes(SEDIMENT, fill=predicted_class)) + geom_bar(width=.5, position = "dodge")

For seabed type:

test_set %>% 
    ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

test_set %>% 
    ggplot(aes(TIDAL, fill=predicted_class)) + geom_bar(width=.5, position = "dodge")

We can confirm in our visualizations that the logistic regression model failed to predict for S_ISOETIFO.

So, we see that the overall accuracy for multinomial logistic regression model vs random forest model was 95.6% and 90.9%, respectively. However, the overall accuracy stated for the logistic regression model is deceiving since it did not perform well in sensitivity in 2 out of 4 classes.